DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2020.22

Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before

Authors: Michał Gańczorz

Published in: LIPIcs, Volume 154, 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)

Abstract

We propose new entropy measures for trees, the known ones are H_k(?), the k-th order (tree label) entropy (Ferragina at al. 2005), and tree entropy H(?) (Jansson et al. 2006), the former considers only the tree labels and the latter only tree shape. The proposed entropy measures, H_k(?|L) and H_k(L|?), exploit the relation between the labels and the tree shape. We prove that they lower bound label entropy and tree entropy, respectively, i.e. H_k(?|L) ≤ H(?) and H_k(L|?) ≤ H_k(L). Besides being theoretically superior, the new measures are significantly smaller in practice. We also propose a new succinct representation of labeled trees which represents a tree T using one of the following bounds: |T|(H(?) + H_k(L|?)) or |T|(H_k(?|L) + H_k(L)). The representation is based on a new, simple method of partitioning the tree, which preserves both tree shape and node degrees. The previous state-of-the-art method of compressing the tree achieved |T|(H(?) + H_k(L)) bits, by combining the results of Ferragina at al. 2005 and Jansson et al. 2006; so proposed representation is not worse and often superior. Moreover, our representation supports standard tree navigation in constant time as well as more complex queries. Such a structure achieving this space bounds was not known before: aforementioned solution only worked for compression alone, our structure is the first which achieves H_k(?) for k>0 and supports such queries. Lastly, our data structure is fairly simple, both conceptually and in terms of the implementation, moreover it uses known tools, which is a counter-argument to the claim that methods based on tree-partitioning are impractical.

Cite as

Michał Gańczorz. Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before. In 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 154, pp. 22:1-22:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{ganczorz:LIPIcs.STACS.2020.22,
  author =	{Ga\'{n}czorz, Micha{\l}},
  title =	{{Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before}},
  booktitle =	{37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)},
  pages =	{22:1--22:29},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-140-5},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{154},
  editor =	{Paul, Christophe and Bl\"{a}ser, Markus},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2020.22},
  URN =		{urn:nbn:de:0030-drops-118836},
  doi =		{10.4230/LIPIcs.STACS.2020.22},
  annote =	{Keywords: succinct data structures, labeled tree, ordered tree, entropy, tree entropy}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.11

Entropy Lower Bounds for Dictionary Compression

Authors: Michał Gańczorz

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|.

Cite as

Michał Gańczorz. Entropy Lower Bounds for Dictionary Compression. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 11:1-11:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{ganczorz:LIPIcs.CPM.2019.11,
  author =	{Ga\'{n}czorz, Micha{\l}},
  title =	{{Entropy Lower Bounds for Dictionary Compression}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{11:1--11:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.11},
  URN =		{urn:nbn:de:0030-drops-104822},
  doi =		{10.4230/LIPIcs.CPM.2019.11},
  annote =	{Keywords: compression, empirical entropy, parsing, lower bounds}
}

Search Results

Documents authored by Gańczorz, Michał

Using Statistical Encoding to Achieve Tree Succinctness Never Seen Before

Abstract

Cite as

Entropy Lower Bounds for Dictionary Compression

Abstract

Cite as

Thanks for your feedback!

Could not send message